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Abstract 

Statistical models of evolution are algebraic varieties in the space of joint probability distri- 
butions on the leaf colorations of a phylogenetic tree. The phylogenetic invariants of a model 
are the polynomials which vanish on the variety. Several widely used models for biological se- 
quences have transition matrices that can be diagonalized by means of the Fourier transform of 
an abelian group. Their phylogenetic invariants form a toric ideal in the Fourier coordinates. 
We determine generators and Grobner bases for these toric ideals. For the Jukes-Cantor and 
Kimura models on a binary tree, our Grobner bases consist of certain explicitly constructed 
polynomials of degree at most four. 

1 Introduction 

Cavender-Felsenstein |3, and Lake jl3j introduced phylogenetic invariants as an algebraic tool for 
reconstructing evolutionary trees from biological sequence data. Such invariants exist for any tree- 
based Markov model, and they uniquely characterize that model. While partial lists of invariants 
have been described for various models [31 [71 ^1 the literature still conveys a sense that 
phylogenetic invariants and algebraic algorithms for computing them are not useful for any problem 
whose size is of biological interest. In his book Inferring Phylogenies, Felsenstein sums this up from 
the perspective of molecular biology as follows: ... invariants are worth attention, not for what 
they do for us now, but what they might lead to in the future... page 390]. A similar tone is 
expressed in the final section of the book Phylogenetics by Semple and Steel ^1 page 212]. 

But the future is closer than readers of these two excellent books might surmise. For the general 
Markov model, considerable progress has been made in the recent work of Allman and Rhodes [T','2|. 
A determinantal representation of the Allman-Rhodes ideal in the binary case has been proposed 
in |14| §5]. The present paper is not concerned with the general Markov model but with a class 
of special models, namely, the group-based models |16l §8.10]. The problem of finding invariants 
for these models was studied by several authors including Evans-Speed 'T, Evans-Zhou Steel- 
Fu ^7] and Szekely-Steel-Erdos 20_. The class of group-based models includes the Jukes-Cantor 
model, for either binary or DNA sequences, and the Kimura models, with two or three parameters. 

The contribution of this paper is an explicit description of generators and a Grobner basis for 
the ideal of phylogenetic invariants of such a model. The key idea can be summarized as follows. 



Theorem 1. For any group based model on a phylogenetic tree T, the prime ideal of phylogenetic 
invariants is generated by the invariants of the local submodels around each interior node of T, 
together with the quadrics which encode conditional independence statements along the splits of T. 

The precise form of this theorem and its proof are given in Section 4. We continue by reviewing 
the evolutionary models to be considered here. Let T be a rooted tree with m leaves. Let V(T) 
denote the set of nodes of T. To each node v € V(T) we associate a fc-ary random variable X^. In 
biology, the most common values of k are 2, 4, and 20. Consider the probability P{X^ = i) that 

is in state i. For DNA sequences this probability represents the proportion of characters in the 
sequence at v which is a particular nucleotide, namely. A, G, C or T. 

The relationship between the random variables X^ is encoded by the structure of the tree. Let 
vr be a distribution of the random variable Xr at the root node r. For each node v G V(T)\{r}, let 
a{v) be the unique parent of v. The transition from a{v) to v is given by a /c x A:-matrix A^""^ of 
probabilities. Then the probability distribution at each node is computed recursively by the rule 

k 

p{x,=j) = 5;4).p(x,(,)=i). (1) 

i=l 

This rule induces a joint distribution on all the random variables X^. We label the leaves of T by 
1,2, ... ,m, and we abbreviate the marginal distribution on the variables at the leaves as follows: 

Pili2...im = P{Xl = il,X2 = i2, ■ ■ ■ ,Xm = im)- (2) 

In biological applications, one estimates (some of) these k"^ probabilities from m aligned sequences 
on k letters, and the aim is to reconstruct the tree. The root distribution vr and the transition 
matrices are typically unknown. In the general Markov model of [J, each matrix entry A-- 
is an independent model parameter. For the group-based models, to be studied in this paper, the 
number of model parameters is smaller because some of the entries of A^ are assumed to coincide. 

A phylogenetic invariant of the model is a polynomial in the leaf probabilities Piii2---im which 
vanishes for every choice of model parameters. The set of these polynomials forms a prime ideal in 
the polynomial ring over the unknowns Pni2---im- Ou'^ objective is to compute this ideal as explicitly 
as possible. In the language of algebraic geometry, we seek to determine the variety parameterized 
by the rational map induced by the joint distribution on the leaves. The study of such varieties for 
various statistical models is a central theme in the emerging field of algebraic statistics (HI EI • 

In this paper, we determine the ideal of invariants for models whose structure is governed by an 
abelian group. Four models used in computational biology have this structure: the Jukes-Cantor 
models and the Kimura models. Theorem 2 below summarizes our results for these models. 

The Jukes- Cantor binary model on two letters {k = 2) is the model with transition matrices 

^(d) ^ f I - a-u Oy \ 
\ a„ 1- ay J ' 

where is the probability of making a transition between the states along the edge from a{v) to v. 
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The Kimura 3 parameter model on k = 4 letters (for DNA sequences) has the transition matrices 



( 1 a^ by Cy Qy by Cy 

tty 1 by Cy by 

by Cy 1 by 

\ by Cly 1 tty by Cy j 



where ay is the probabihty of a transition and by and Cy are the transversion probabihties. The 
Kimura 2-parameter model arises as the subvariety defined by taking by = Cy for all v and the 
Jukes-Cantor DNA model is the subvariety defined by setting ay = by = Cy for all v. 

Evans and Speed ^ introduced a linear change of coordinates which diagonalizes these models. 
In Section 2 we review their Fourier transform at the level of generality proposed by Szekely-Steel- 
Erdos [20] . The key idea is to label the states of the random variables Xy by a finite abelian group 
(e.g. Z2 = {0, 1} or Z2 X Z2 = {A, G, C, T}) in such a way that the probability of transitioning 
from Qi to gj depends only on the difference gi — gj. Replacing the original coordinates Pi^-i^ 
by Fourier coordinates qi^...i^, the ideal of phylogenetic invariants becomes a toric ideal. Recall 
(e.g. from jj^) that a toric ideal is a prime ideal generated by differences of monomials. 

As an example consider the Jukes-Cantor binary model for m = 4. The Fourier coordinates are 

1111 

lijki = where i,j,A;,Z G Z2. (3) 

r=0 s=0 t=0 w=0 

If T is the balanced binary tree of height two, then this model has the parametric representation 

Qijki ^ cii ■ bj ■ Ck ■ di ■ Ci+j ■ fk+r gi+j+k+i- (4) 

Disregarding the trivial invariant g'oooo ~ 1) the toric ideal of phylogenetic invariants is generated 
by 20 linearly independent quadrics. These arise as the 2 x 2-minors of the four 2 x 4-matrices 

QOiOO QOiOl QOiW qOiU \ f ^OOiO ^OliO ^lOiO 9lli0 A £qj, ^ _ q i (^^^ 

'J'l{l+i)00 Ql{l+i)01 'i'l{l+i)10 '71(1+1)11/ V'i'00(l+i)l 901(l+i)l 'i'lO(l+j)l 'i'll(l+j)l/ ' 

Moreover, these quadrics form a Grobner basis for a suitable term order. This generalizes as follows: 
Theorem 2. Let T be an arbitrary binary rooted tree. Modulo the trivial invariant qoo---o — 1, 

(a) the ideal of the Jukes-Cantor binary model is generated by polynomials of degree 2, 

(b) the ideal of the Jukes-Cantor DNA model is generated by polynomials of degree 1, 2 and 2>, 

(c) the ideal of the Kimura 2-parameter model is generated by polynomials of degree 1, 2, 3 and 4, 

(d) the ideal of the Kimura 3-parameter model is generated by polynomials of degree 2, 3 and 4. 
Each of these generating sets has an explicit combinatorial description and it is a Grobner basis. 



3 



The outline for the paper is as follows. In Section 2 we review the Fourier transform technique 
introduced by Evans and Speed [1] for diagonalizing group-based models. This is done for arbitrary 
finite abelian groups, as in [20], and it reduces our problem to computing the kernel of a monomial 
map as in (jlj). Section 3 turns rooted trees on m leaves into unrooted trees on m + 1 leaves, and 
it introduces "friendly labelings" on abelian groups. These labelings are used to classify the linear 
model invariants, and to set up a coordinate system modulo the linear invariants. This generalizes 
the construction in • In Section 4 we state and prove the precise form of Theorem Q It reduces 
the construction of invariants to the case of claw trees -fCi,m- This case is studied in Section 5. 

Section 6 is aimed at computational biologists interested in experimenting with our invariants. 
Theorem 121 is derived by describing the generating sets explicitly. Section 7 concerns the question 
whether phylogenetics really needs all of the many invariants in our generating sets. We argue that 
the answer is affirmative. We demonstrate by means of an example that algebraically independent 
invariants do not suffice to characterize an evolutionary model, in contrast to what was suggested 
in [H [TH]. Conclusions, algorithmic questions and open problems appear in Section 8. 

2 A Linear Change of Coordinates 

The Fourier transform provides a linear change of coordinates that transforms the irreducible variety 
of distributions of a group-based model into a toric variety. Our presentation in this section is an 
exposition of the constructions in jlj and [201 • Experts in combinatorial commutative algebra JHl 
will be surprised to encounter toric ideals whose natural coordinate system is the wrong one: the 
equations in the given coordinates Pi^i2...im very far from binomial, and the task at hand is to 
find new coordinates qi^i2...im so that the equations become binomials — . 

Example 3. The Jukes-Cantor binary model for the rooted claw tree i^i,3 has the parameterization 



Pooo = 

Poio = 

Pioo = 

Piio = 



7roao/3o7o + 7riai/3i7i 
7roao/3i7o + vriQi/3o7i 
7roai/3o7o + vriao/?i7i 
7roai/3i7o + 7riao/?o7i 



Pool 
Poll 
Pioi 
Pill 



7roao/3o7i + t^io^iPho, 
7roao/?i7i + vriai^o7o, 
7roai/?o7i + vriao/3i7o, 
7roai/?i7i + 7riao/?o7o- 



The Fourier transform 



;ives a linear change of coordinates in the parameter space, 



TTo = 2(^0 + n), TTi = 2(ro - ri), ao = 2('^o + ai), ai = ^{qq - oi) 
Po = ^{bo + 61), Pi = ^{bo - 61), 70 = i(co + ci), 71 = i(co - ci). 



and it simultaneously gives a linear change of coordinates in the probability space: 



1 



1 



1 




(6) 



r=0 s=0 t=0 



After these coordinate changes, our model is given by the monomial parameterization: 



^000 = roaoboco, gooi 
gioo = nai^oco, gioi 



riaoboci, goio 
rooiftoci, quo 



rioofcico, qou 
roaibico, qm 



roaobici, 
naibici. 
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The toric ideal of algebraic relations among these monomials has the following Grobner basis: 

{ 'Zooi^iio - 9000^111 1 'Zoio'Zioi — ^0009111 > 'Zioo'Zoii — 'Zooo'Ziii }• 
The inverse to © now translates each of the three binomials into a quadric with eight terms, e.g., 

PooiPoio +P001P100 - PoooPoii - PoooPioi +P100P111 - PioiPiio +P010P111 - POllPllO- 
These three eight-term quadrics generate the ideal of phylogenetic invariants for this model. □ 
Recall from © and that the joint distribution of a Markov model on a tree T has the form 

v&V(T)\{r} 

where the sum is over all states of the interior nodes of the tree T. Here we are concerned with the 
case when the states of the random variables are the elements of a finite additive abelian group G, 
and the transition matrix entry depends only on the difference of ga{v) ^-^id in G. We 

denote this entry by f^'"\ga(v) ~ 9v)- Hence group based models of evolution have the form 

Pgi,-,9n^ = Pi9l,---,9m) = ^-^{gr) n f^'"\9a{v)- Qv)- (8) 

veV{T)\{r} 

The right hand side is a polynomial of degree equal to the number of edges of T plus one, and the 
number of terms of this polynomial is k raised to the number of interior nodes of T. Our aim is to 
perform a linear change of coordinates so that this big polynomial becomes a monomial. 

The dual group to G (or character group of G) is the group of all group homomorphisms from 
G into the multiplicative group of complex numbers. It is denoted by G = Hom(G, C^). The 
elements of G are the characters of G and a typical element of G is denoted by the letter x- 

Given any function / : G — s- C, the Fourier transform / is the function f : G ^ C defined by 

fix) = Y.^{9)f{9). 
Given two functions /i and /2 on G, their convolution fi * f2 is the new function defined by 

(/l*/2)(5) = E ^1(^)^2(5 -/i). 

The following facts about the dual group and the Fourier transform are well-known. 

Lemma 4. Let fi, f2 be functions from a finite abelian group G to C and 1 the constant function. 

(a) The group G and the dual group G are isomorphic as abstract groups. 

(b) Fourier transform turns convolution into multiplication, i.e., fi * f2 = fi ■ f2, and 
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(c) = 1^1 if X = ^ (the unit in G), and = otherwise. 

The main theorem about discrete Fourier analysis and group based models is that the Fourier 
transform of the joint distribution has a parameterization that can be written in product form. 
Hence, a phylogenetic model with group structure is a toric variety. Note that if the abelian group 
for the group model is any group other than Zg, then the coordinate transformation requires the 
complex numbers. Before we state the general result, we illustrate the idea with a small example. 

Example 5 (Fourier transform for the simplest trees). Let T = Ki^rn be the tree whose only 
nodes are the m leaves and the root. The joint probability of a group based model is given by 

m 

Pigi,g2,---,9m) = J]^(/i)n/'H^-5i)- (9) 

heG 1=1 

We will take the Fourier transform of this probability density with respect to the group G™. To 
do this, we replace the root distribution vr : G — > M by a new function vr : ^ M as follows: 



7r{hi, . . . , hm) 

Then we have 



7r(/ii) if /ii = /l2 = • • • = /in 
otherwise 



P{9i,92,---,9m) = ^ TT{hi,...,hm)Y\_f^'Hhi- gi). 

(/ii,...,fe™.)GG™ i=l 

Thus p is the convolution of two functions on G™. Taking the Fourier transform yields 

m 

q{Xl,---,Xm) = T^(.Xl,---,Xm)Y[f'^'-\Xi) 

1=1 

by the convolution formula and the independence of the f^^^ in the Fourier transform. Furthermore, 

T^iXl,- ■ ■ ,Xm) = E(gi,...,g„)GG" {{Xl, ■ ■ ■ , Xm) , {9l, ■ ■ ■ , 9m)) • vf (5-1 , . . . , C/„) 
= T^geG (XlX2- ■ -Xm, 9) ■ T^ig) = T^iXlX2- ■ -Xm), 

and hence 

m 

q{xi,---,xm) = ^xi---Xm)'[lWKxi) (10) 

i=l 

Example El is the base case in the induction needed to prove the following general result. 

Theorem 6. (Evans-Speed [1]) Let p{9i, . . . ,9m) be the joint distribution of a group based model 
for the phylogenetic tree T , parametrized as in Then the Fourier transform of p has the form 

q{xi,---,xm) = Hxi---xm)- n f^Kll Xl) (11) 

fGV(T)\{r} leA(v) 

where A{v) is the set of leaves which have v as a common ancestor. 
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We refer to |^ and j2L)j for the proof of Theorem El The transformation from @ to (|1U() is a 
special case of the transformation from (jH)) to Ullj) . Formula Q is a polynomial parameterization of 
the evolutionary model, and formula is a monomial parameterization of the same model. Since 
G and G are isomorphic groups, we can rewrite the monomial parameterization as follows: 

i;GV{T)\{r} /gA(i-) 

We regard this formula as a monomial map from a polynomial ring in unknowns 

= g(5i,---,5m) 

to the polynomial ring in the (not necessarily distinct) unknowns Ti{g) and f^^\g), which are 
indexed by nodes of T and elements of G. Our aim is to determine the kernel of this map. 

3 Edge Labelings and Linear Invariants 

In this section, we determine all linear forms qgi,...,gm ~Qhi,...,hm in the kernel of our monomial map 
and we set up a convenient coordinate system for working modulo these linear invariants. 
Our construction is inspired by the work of Steel and Fu |17j on classifying the linear invariants. 

We first add an extra edge at the root of T to achieve a new tree with m + 1 leaves. To keep 
notation simple, we denote the new tree also by T. Let E{T) be its set of edges. We next associate 
a set of parameters to each e G E{T) by "moving" the parameters from a given node to the edge 
directly above it. Given an assignment of group elements {gi, . . . ,gm) to the m leaves of T, we get, 
for each edge e of T an assignment of a group element g{e) as follows: 

9{e) = ^ 9v 

i'GA(e) 

Here A(e) is the set of leaves below e. With this notation, we have eliminated the special distinction 
of the root distribution, and our monomial parameterization (|12|) can be rewritten as 

Qsi-sr. - n f'-'^iaie)). (13) 

eGE(T) 

If the unknowns f^^\g) are all distinct then there are no linear invariants. This happens in the 
Jukes-Cantor binary model and in the Kimura 3-parameter model. However, in general, we allow 
the possibility that f^^\g) = f^^\g') for distinct group elements g,g' € G. To deal with this issue, 
we introduce labeling functions. Let £ be a finite set of labels. A labeling function is any function 

L : G^C 

such that f^^\g) = f^^\g') if and only if L{g) = L(g'). For the time being, we will assume that 
the labeling function associated to each edge of the tree is the same for every edge. However, we 
will show later that this assumption can be dropped in some special instances. 
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Given such a labeling function L we can now write our monomial parameterization ()13() in a 
standard commutative algebra notation. For every edge e of the tree T and every label I £ C we 

(e) 

introduce an indeterminate . These indeterminates are now distinct. The polynomial ring in 

these unknowns with complex coefficients is denoted Cfal*^^]. Similarly, C[(7g^...g^] is the polynomial 
ring generated by the Fourier coordinates. We wish to study the ring homomorphism 

The kernel of this map is the toric ideal of phylogenetic invariants in the Fourier transform of the 
probabilities. We denote this ideal by It,l suppressing dependence on the group G. From this 
description, we immediately can deduce the structure of the linear phylogenetic invariants. 

Proposition 7 (Linear Invariants). The vector space of linear polynomials in the ideal It,l is 
spanned by all differences qgx...gm ~ Qhi...hm where L{g{e)) = L{h{e)) for all edges e of T . 

Proof. Since It,l is a toric ideal, it has a vector space basis consisting of binomials — . In 
particular, the subspace of linear polynomials in It^l is spanned by differences of unknowns qgx...gm ~ 
qhi...hm- Such a difference lies in It,l if and only if IleeBCT) "L(g(e)) ^ lle€E{T) "-Lihie))- ^ince the 
unknowns a^^^ are all distinct, this happens if and only if L{g{e)) = Yl,v&K{e) 9'" coincides with 
L{h{e)) = YjveK{e) for all e G T. □ 

We now introduce coordinates for the polynomial ring C[ggj...g^] modulo the ideal generated by 
the linear invariants in It,l- The labeling function L : G ^ C induces the function 

(ffi,...,<7n^)-(^(5(e))),,^(^). (15) 

Let im(L"^) denote the image of this map. We call im(L"^) the set of consistent labelings of the tree 
T. For each A E im(L-^) we introduce a new unknown qx- These generate a new polynomial ring. 
Proposition [7| implies that our monomial map (|14() is the composition of the map 

'^[Qgi-gm ■ (5i,---,ffm) e G"™] ^ C[qx : A G im(L^) ] , qgi...g„, ^ qLT{g,,...,g^), 
and the following monomial map which has no linear forms in its kernel: 

C[qx : A G im(L^)] ^ C[a\'^ : e G E{T), I e C] , qx ^ H ^t)' (^6) 

eeE{T) 

Our objective is to determine the kernel of the monomial map H16|) . This kernel is the toric ideal 
It^l niodulo linear invariants. We use the same symbol It^l to denote the kernel of ()16|). 

Our main result, which will be stated and proved in the next section, is valid only for a certain 
subclass of labeling functions. These will be called the friendly labeling functions. Fortunately, all 
labeling functions which arise naturally in statistical models of evolution are friendly. 
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Definition 8. Fix a labeling function L : G — > £ on the group G. For m > 3 consider the set 

m—l 

Z = {((7i,---,9m) G : ^ffi =5„}. 

i=l 

Consider the induced map L : Z C CT^ and denote by vrj the projection vrj : G onto 

the i-th coordinate. The function L is called m-friendly if, for every I = {h, . . . ,lm) G L{Z) C 

^i(L"i(/)) = for alli = l,...,m. (17) 

Note that the inclusion "C" always holds. But for most labeling functions it will be strict. Note 
that Z is the set of all allowable assignments of group elements to the edges of the unrooted tree 
T = Ki^rn- The definition of m-friendly guarantees that if a particular labeling A comes from an 
assignment of group elements, then any choice of a group element to one particular edge e which is 
consistent with A at e can be extended to an assignment that is consistent with A on all the edge 
of Kl,m- 

Example 9. Let G = Z4 and C = {0, 1, 2}. Then the labeling function L defined by 

L(0) = 0, L(l) = 1, L(2) = L(3) = 2 

is not 3-friendly because L-^{2) = {2, 3} strictly contains 'k^{L-^{{1, 1, 2))) = vr3({(l, 1, 2)}) = {2}. 

The next example looks similar, but it is, in fact, much more friendly. 

Example 10 (The Kimura 2-parameter labeling function). Let ^ = ^2X^2 and C = 
{0, 1, 2}. The Kimura 2-parameter model corresponds to the labeling function L defined by 

L((0, 0)) = 0, L((0, 1)) = 1, L((l, 0)) = L((l, 1)) = 2. 

It can be checked by an explicit calculation that L is 3-friendly. 

We say that a labeling function L : G ^ C is friendly if it is m-friendly for all m > 3. 

Lemma 11. Labeling functions that are 3-friendly are friendly. 

Proof. We will show that a labeling function that is 3-friendly and m-friendly is also (m -|- 1)- 
friendly. Let / G L{Z). We will show that 7rm+i(Z~^(/)) = L^^{lm+i). Let I' = {L{gi + 
g2),L{gs),...,L{gm+i)) where (51, • • • , 5m+i) G L'^{1)- Since L is m-friendly, for every hm+i G 
L'~^{lm+i) there is an assignment of group elements h' = (/ig, /13, . . . , /im+i)- Furthermore, L 
is 3-friendly so there is some choice of group assignment (/ii, /i2, /12) ^^^^ realizes the labeling 
{L{gi),L{g2),L{gi +52))- But then h = {hi, h2, h^, . . . ,hm+i) has iTm+iih) = /im+i as desired. □ 

Lemma ^2 says that checking whether a labeling is friendly can be done simply with a finite 
computation. The point of studying friendly labelings is that consistent labelings "glue" together. 
We will now make this statement explicit. Let e be an interior edge of the tree T. Denote by Tee- 
the tree obtained from T by taking the edge e and all the edges below e. Denote by Te^+ the tree 
obtained from T be taking the edge e and all edges not in T^^-. Then we have the following 
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Lemma 12. Let A~ and be consistent labelings o/Te^_ and Te^+ respectively, i.e. \~ € im^L^'^-^) 
and A"*" G im(L-^='+). Suppose furthermore that A~(e) = A"'"(e). Then the labeling X of T obtained 
from X~ and A"^ by labeling edges of T appropriately is consistent, i.e., X G im(L-^). 

Proof. Since A^ and A^ are consistent, there is some assignment of group elements to the edges 
of Te^+ and Te_ that comes from (L'^'='+)~^(im(L'^'='+)) and (L'^'='-)~^(im(L^'= -)). We will now 
construct an assignment of group elements of the edges of T that belongs to (-L"^)~^(im(L^)). First 
take any assignment which is compatible with A+ on Te^+. This assigns some group element to the 
edge e. Let v be the nonleaf vertex of Te^_ incident to e. Since L is friendly, and A~ is consistent, 
there exists an assignment of group elements to all the other edges incident to v which is compatible 
with A~(e) and is locally consistent. By induction on the number of interior vertices of Te^_ we 
construct a globally consistent assignment of group elements to the edges of T. □ 

Lemma [T2l is the main technical result upon which all our combinatorial constructions of gen- 
erators and Grobner bases rest. Indeed, as we will see, it implies that phylogenetic invariants of 
group based models with friendly labelings are only determined by local features of the tree. We 
conclude this section with some examples of friendly labeling functions. 

Example 13. Let G be any finite abelian group. Any function L : G ^ C that is injective 
is friendly for trivial reasons (the two sets in (|17|) are singletons and hence equal). For similar 
reasons, if C consists of elements of a group and L is a group homomorphism then L is friendly. 

Example 14 (The Jukes-Cantor labeling function). Let £ = {0,1} and L the function 

L( ) = { ^ if 5 = 
\ 1 otherwise 

Then L is friendly for any group G. It corresponds to the Jukes-Cantor models when G = Zl^. 
Example 15. The Kimura 2-parameter labeling function of Example 1101 is friendly by Lemma ITTl 

4 The Main Result 

We will now state and prove our main result concerning the ideal of phylogenetic invariants of any 
group based model with friendly labeling function L. We consider the toric ideal It,l which is the 
kernel of the monomial map (|16|) . and we construct minimal generators and a Grobner basis for 
It,l out of purely local information in the tree. This Grobner basis is a list of binomials q"^ — in 
the unknowns qi which are indexed by the consistent labelings I G im(L^). In order to transform 
the binomials into polynomials in the probabilities Pgi,...,g^, one must reverse the transformations 
described in Sections 2 and 3. In Section 6, we will characterize the consistent labelings and examine 
the relevant transformations for the four standard models of Theorem [2j Throughout this section, 
we assume that L : G ^ C is an arbitrary friendly labeling on a finite abelian group G. 
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For ease of notation, we write the monomials in the unknowns qi using the tableau notation. 
This means that any monomial M = qi^qi^ ■ ' ' Ql^ is written as a matrix of format d x \E{T)\: 



M 



h 
h 

Id 



Such a matrix with entries in L is called a tableau. The columns of a tableau are indexed by the 
edges of the tree T under consideration. The number of rows of M is the degree d of the monomial. 
Two tableaux represent the same monomial if they are related by a permutation of rows. 

Binomials q^ — in the unknowns qi are represented as formal differences M — M' of tableaux. 
Notice that it is easy to check whether a given binomial M — M' lies in the toric ideal It,l- 

Remark 16. Let M and M' be two tableaux of format d x \E{T)\ with entries in L. Then the 
binomial M — M' lies in the ideal It,l if and only if the following two conditions hold: 

(a) each row of Af and each row of M' is a consistent labeling for the tree T, and 

(b) for each edge e G E{T), the multiset of labels in column e is the same in M and in M' . 

We are now ready to construct the binomials that will constitute the Grobner bases of It,l- 
Let e be an interior edge of T, and let Te^_ and Tg^^ be the two subtrees as in Lemma IT^ After 
relabeling the edges of T, every tableau M can be written in three groups of columns, 



M 



11 mi rii 

12 m2 n2 



d ma Ud 



where the left columns (with entries li) correspond to the edges in Te^_\{e}, the middle column 
corresponds to the edge e, and the right columns correspond to the edges in Te^+\{e}. 

Lemma 17. Let {li,m,ni) and {l2,inn,n2) be consistent labelings ofT. Then the quadratic binomial 





m ni 




' h 


m 77-2 


. ^2 


m n2 




. ^2 


m rii 



lies in the toric ideal It 



L- 



Proof. The labelings (/i,m) and (hjin) are consistent for the subtree Te^_, and the labelings 
(m, ni) and (771,722) are consistent for the subtree Te^+. By Lemma |121 the labelings (/i, 771,772) 
and (/2, 777, ni) are consistent for the big tree T. Remark 1 1 61 implies that g G It,l- D 



Definition 18. Denote by Quad(e,T) the set of all the quadratic binomials g from Lemma [T71 
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Consider now an arbitrary binomial in the ideal It l- It has the form 



li nil ^1 
h rrid Ud 



where the rrii and m'^ are single labels corresponding to the edge e, the li and are consistent 
labelings of Te__\{e}, and the and n'- are consistent labelings of Te+\{e}. Note that since the 
binomial h belongs to Itj, the multiset of labels which appears on the edge e must be the same 
for both terms of h. Hence, after rearranging the rows of the tableau we may write 



h 



li mi ni 



Id rud Ud 



I'l nil 



d 



Every binomial in It^l restricts to a binomial in It^ and to a binomial in It^ ^^l. Namely, 



if h is the binomial above, then the following binomial lies in It^ 



Similarly, deleting the left columns yields a binomial h\T^ ^ in It^ ^^l. We now state a constructive 
converse, from which binomials in It^ and It^ can be extended to binomials in It,l- 

Lemma 19. Let g he a binomial in It^ written in tableau notation as 



' h 


nil 






nil 


_ Id 


rrid _ 




J'd 


rrid _ 



li nil 




'I'l 


nil 


_h rnd _ 




J'd 


rrid _ 



9 = 

Let 71-1, . . . ,nd be sequences of labels such that each {mi,ni) is a consistent labeling ofT(,^+- Then 

li nil n-i 



9 



d rud Ud 



l[ mi ni 
I'd iTT-d rid 



is a binomial in It.l- 



Proof. Restricting the two tableaux to the tree Tf.^ and Tf^^+ shows that the multiset of labels 
which appears on each edge are the same. In fact, we have 



9*\t,. = 9 and 5*|t, 



0. 



We must check that each of {li,mi,ni) and {l'^,mi,ni) is a consistent labeling on T. Lemma IT2] 
implies this because {li,mi) and {l[,mi) are consistent on T^,- and {mi,ni) is consistent on Te^+. □ 
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Definition 20. Let B he a collection of binomials in It^ We define Ext(i3 — s- T) to be the 
set of all binomials g* where g ranges over B and ni, . . . , ranges over of sequences of labels as in 
Lemma ITUl Similarly, we define Ext(r <— B) for any collection of binomials B in lTe+,L- 

The first main result of this paper is the following theorem. 

Theorem 21. Let T he any tree with a friendly labeling L : G ^ C and let e he any interior edge 
of T. Suppose that B^ is a binomial generating set for It^ and let B+ be a binomial generating 
set for It^ ^^l. Then the following set of binomials generates the toric ideal It,l' 



Ext{B-^T) U Ext(r^;8+) U Quad(e,T). 



(18) 



Moreover, if B- is a Grobner basis for It^_^l and Bj^ is a Grobner basis for Itc+,l, then there 
exists a term order on C[qx : AG im(L-^)] such that the set in is a Grobner basis for It,l- 

Proof. We first prove the second statement concerning Grobner bases. To this end we need to 
specify the term orders. Let -<- be any term order on C^qx : AG im(L^'='~)] such that B- is 
a Grobner basis for It^-^l and ^+ any term order on C\^q\ : AG im(L-^'='+)] such that Bj^ is a 
Grobner basis for It^ ^^l. Finally, let us define a reverse lexicographic term order -<q which makes 
Quad(e, T) a Grobner basis for the ideal it generates. We do this by first taking any total order 
on the labels of the edge e, then taking total orders -<2 on im(L-'"'='-) and ^3 on im(L^'='+) which 
are refinements of -<i. The revlex term order ~<q is obtained by declaring g^i '<Q QX2 only if 



A^ ^2 A2 



or {X-^ = A2 and Xf ^3 A2 



We construct a product term order on the polynomial ring C[qx : AG im(L"^)] as follows. If 
M and M' are monomials (tableaux with columns indexed by E{T)) then M -<j' M' if and only if 

1. M|t,,_ M'It,,-, or 

2. M|t,,_ = M'\t, _ and M[t,,+ M'|t,,+ , or 

3. M\t, _ = M'\t, _ and M\t,^^ = M'|t,,+ and M -<q M' . 

Our goal is to show that the set (|18|) is a Grobner basis for It^l with respect to the term order -<t, 
i.e., the leading term of every binomial g in It.l is divisible by the leading term of some binomial 
from (|18j) . To prove this, we consider an arbitrary binomial in our toric ideal: 



M' -M 



li mi ni 



Id rad Ud 



li mi rii 



I'd 



md 



G It,. 



Suppose that M' is the leading term of g. There are precisely three different ways this can happen, 
according to the three cases in the definition of -<t- Each case will be analyzed separately. 
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Case 1: Suppose that M\'j'^ _ ^„ M'\t^_. Then g\Te - is a nonzero binomial in Itc-,l and 
-^'Itc _ is its leading term. Since B- is a Grobner basis there exists a binomial h = N' — N £ B- 
whose leading term A^' divides M' . Upon reordering the rows of M' and M, we may suppose that 



N' -N 



li mi 



k 



rrii 



I'l mi 



mi 



for some i < d. 



Here h, . . . ,li and mi, . . . ,mi are the same labels that appear in M'. Now we consider the binomial 
h* G Ext(B_ T) obtained by appending the labels ni, , 



{N' 



N* 



li mi ni 



m,; 



/" mi rii 



I'l 



rrii m 



The tableau (A^')* is the leading term of h* with respect to -<t, and {N')* divides M' as desired. 

Case 2: Suppose M\t^_ = M\'rp^_ and M\t^^ ^+ M'\t^^. Then by the same argument as in 
Case 1, we deduce that there is a binomial h* G Ext{T <— 5+) whose leading term divides M' . 

Case 3: Suppose that M|t,,_ = M'\t, _ and M|Te,+ = and M <q M' . The only way 

that this could happen is if there exists a pair of rows in M\ (/i,m,ni) and (l2,m,n2), such that 
(/i,m) -<i {hifn) and (m,ni) >-2 (m,n2). But then the binomial h G Quad(e, T) given by 



h 



N' -N 



' h 


m rii 






m n2 


. ^2 


m 712 






m ni 



has leading term A^', and this leading term divides the leading term M' of the binomial g. 

These three cases together establish the second statement: the set H18|) is a Grobner basis for 
It,l- Furthermore, for any Grobner bases B- and we have the equality of ideals 



It,l 



(Ext(S_^T)) + (Ext(T^S+)) + (Quad(e,r); 



In this equation, we may replace Ext(;B_ T) with any set that generates (Ext(jB_ — > T) ). But 
Ext(C_ — > T) generates (Ext(;B_ T)) whenever C_ is a generating set for It^_^l- A similar 
statement holds for Jj-^ ^^l. This completes the proof of the first statement in Theorem 1211 □ 

Theorem 1 in the Introduction says that all invariants are determined by local features of the 
tree. We shall now state this result more precisely and derive it as a corollary from Theorem 1211 

Let V be an interior vertex of the tree T, and let ei, . . . , Cc the edges of T incident to v. Denote 
by Tl),ei subtree T^^- or Tg-^^ which has v as a leaf. Given a particular label / for the edge e^, 
denote by im(L^''''=i , I) the set of all consistent labelings of T^^^i which has the label of Cj equal to /. 
Denote by T„ the subtree of T with only interior node v and edges ei, . . . , Cc- Note that Ty is the 
claw tree Ki^c- It has no interior edges. These definitions are illustrated in Figure 1. 
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Figure 1: The subtrees around a vertex v of the tree T 



Lemma 22. Let g be any binomial in the ideal of the claw tree T^, written in tableau notation as 



For each row i and column j, consider labelings L\ € im(L "'"^ , l^) and Ml G im(L "'"j' , if) with th 
property that the multiset {L^}^^]^ is equal to the multiset {M^}'^^^. Then the binomial 



Ml 



Ml 



Mf 



M' 



belongs to toric ideal It,l of the big tree T. 



Proof. Since L is friendly, each row in the tableaux is a consistent labeling. Restricting to each 
subtree yields the same multiset of labels. Hence the binomial g* is in It,l- D 

Definition 23. Let B be any set of binomials in the (claw tree) ideal It^,l- Denote by Ext(S T) 
the set of all binomials g* gotten by applying the construction in Lemma[211to the binomials g & B. 

Theorem 24 (Local Structure of Invariants). Let T be a tree with a friendly labeling L : G ^ 
L. For each interior vertex v of the tree T , let B^ denote a binomial generating set for Itv,l- Then 
the following set of binomials generates the ideal It,l of all phylogenetic invariants ofT: 



IJ Ext(S„ ^ T) U IJ Quad(e, T). 



(19) 
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The first union is over the interior vertices ofT. The second union is over the interior edges ofT. 

Proof. We proceed by induction on the number of interior vertices of T. If there is only one interior 
vertex then the statement is a tautology. Suppose there are m >2 interior vertices. There exists an 
interior vertex v which is incident to only one other interior vertex u. Let e be the edge connecting 
V and u. The tree Tg^- has m — 1 interior vertices and the tree Te^+ has only one interior vertex. By 
induction, the corresponding ideals have generating sets that come in the form of (|19)) . Applying 
Theorem 1^ yields a generating set for It^l which is larger than the set of binomials listed in (|19() . 
We claim that every binomial in the set difference (|18|) \ (|19|) belongs to the ideal generated by 
H19() . Indeed, each such binomial differs from a binomial in ()19p by swapping some of the labels in 
the columns corresponding to the tree T^^e- Such a swap can occur only when the edge label at e 
itself is the same for each row of the tableau involved in the swap. But such a swap (or sequence 
of such swaps) can be realized by adding multiples of the quadratic binomials in Quad(e,T). □ 



5 The Toric Algebra of Group Multiplication 

The results of the previous section reduce the computation of our toric ideals of phylogenetic 
invariants to the local case, namely, when the tree has only one interior node. Such a tree is a claw 
tree Ki^n- The corresponding toric ideal lG,n depends only on two parameters: a finite (additive) 
abelian group G and a positive integer n. This construction furnishes a new family of numerical 
invariants for any group G, and it may hence be of independent interest to algebraists. 

Throughout this section we assume that the labeling function L is the identity map on a finite 
group G. Our object of interest is the following monomial map between polynomial rings: 

C[95i,...,a„ : 5i,---,5n e G] C[a^g^ : g e G, i = 1, . . . ,n+l\ 

(1) (2) (n) (n+l) 

Qgi,...,gn C'gi ('■92 ■ ■ • '^Qn Si+92---+9n 

Let iQ^n denote the kernel of this ring homomorphism. This is the ideal of phylogenetic invariants 
in the Fourier coordinates for the claw tree Ki^n. Note that the definition of the toric ideal lG,n 
makes sense for any group G, even if G is not abelian. It encodes the group multiplication table. 
The following example is the basic building block for the Kimura 3-parameter model on a binary 
tree. 

Example 25. Let n = 2 and G = Z2 x Z2. We identify the group elements with the nucleotides: 

A=(0,0), G=(0,1), G=(1,0), r=(i,i). 

Then /z2xZ2,2 is an ideal in C[qAA, QAG, QAC, QAT, QGA, QCG, QGC, QGT, QCA, QCG, QCC, QCT, QTA, QTG, 
QtCjQtt] ■ It is the kernel of the monomial map g^^gj ^ ^giVg2'^g\+g2- More specifically, 

QAA ^ XAVAZA, ■■■ , QAT ^ XaVtZt , ■■■ , QGC ^ XcVcZt , • • • , QTT ^ XtVtZA- 



16 



The toric ideal 1^2x22,2 is minimally generated by the 16 cubics 

qAAqCTQTG — qAGQCAQTT, qAAqGTQTC — qACQCAqTT, qACqCTQTA — qATqCAQTC, 

QAcqccqTA — qAAqGcqTG, qAGqccqxA — qAAqcGQTc , qAGqGcqcA — qAcQGAQcG, 
qAGqGTqcc — qAcqGGQcT, qAGqGxqxA — qATQGAqrG, qArqGcqxG — qAcqcGqrr, 

QATQGAqGC — qAAqGCQCT, qATqGGqGA — qAAQGrqCG, qATqGGQTC — qAGQGCqTT, 

qGAqcGqTG — qGGqGAqrc, qGcqcrqTG — qGTqcGqrc, qGGqGxqxA — qGAqcGqrr, 

qGTqGcqTA — qGcqcAqxr 

and the 18 quartics 

qAAqATqTGqxc — qAGqAcqrAqrr, qAAqGGqcrqrc — qAGqGAqcGqrr, 
qAAqGxqGcqxG — qAcqcGqGAqrT, qAAqGrqGxqrA — qArqGAqcAqrT, 
qAcqAxqGAqGG - qAAqAGqGGqGx, qAcqGAqcGqxA - qAAqGcqcAqrc, 
qAGqGAqGxqxG — qAGqcTqGAqTc, qAcqGrqGcqrA — qAxqGcqcAqTG, 
qAGqATqGAqGc — qAAqAGqcGqGx, qAGqGGqcGqrG — qAcqGGqcGqrc, 
qAGqGcqcTqTA - qATqGAqcGqrc, qAcqcGqcAqxA - qAAqGAqcGqrG, 
qATqGGqccqxA - qAAqGGqGGqrr, qATqGGqcrqrG - qAGqGTqcGqrr, 
qATqGTqGcqrc — qAcqGGqGTqrr, qccqcTqxAqTG — qGAqGGqrcqTT, 
qGAqGTqcGqGc - qGcqGcqcAqGT, qGGqGTqTAqxc - qcAqGcqTGqTT- 

Geometrically, these binomials define a 11-dimensional toric variety of degree 96 in □ 

Let (f){G,n) denote the largest degree of any minimal generator of the toric ideal lG,n- We 
computed these numbers for some small groups G and small values of n using the toric algebra 
software 4ti2 written by the Hemmeckes The results are displayed in the following table: 



G 


n 


2 


3 


4 


5 


6 




Z2 


2 




















Z2 


3 


3 














2 


Z2 


4 


30 














2 


Z2 


5 


195 














2 


Z2 


6 


1050 














2 


Z3 


2 





2 











3 


Z3 


3 


54 


24 











3 


Z4 


2 





16 


6 








4 


Z4 


3 


344 


256 


96 








4 


Z2 X Z2 


2 





16 


18 








4 


Z2 X Z2 


3 


360 


261 


480 








4 


Z5 


2 





50 


50 








4 


Zg 


2 





116 


675 


216 


126 


6 


Z7 


2 





245 


1764 


1764 


294 


6 



17 



The entry in the row labeled (G, n) and column labeled i is the number of minimal generators 
of Ic^n having degree i. For the two element group Z2 we can prove the following general result: 

Theorem 26. The toric ideal 1^2, n is generated in degree two. In symbols, (j)(Z2,n-) = 2 for n > 3. 

Proof. Following the discussion in the previous section, the monomials in the polynomial ring 
^[Qgi,--;gn '■ 9i ^ {Oj 1}] ^-^e identified with tableaux. An m x (n + l)-tableaux T with entries in 
{0, 1} represents a monomial if and only if all row sums of T are even. The ideal 1^2, n is spanned 
by all binomials T — T' where T and T' are such tableaux which have the same column sums. 

Consider any binomial T — T' in the toric ideal Iz2,n- We can pick any two columns i and j and 
switch each in these two columns to a 1 and vice versa. The resulting tableaux still have even 
row sums and their difference is in Iz2,n- We will use this symmetry in the next paragraph. 

Suppose that /z2,n is not generated by quadrics. Then the ideal contains a binomial T — T' of 
degree m > 3 such that T and T' cannot be connected by moves involving only two rows at a time. 
Such a move corresponds to adding a multiple of a quadratic binomial. We may suppose that m is 
the smallest degree of any such monomial. After permuting columns and applying the symmetry 
described above, we may assume that 



T-T' 



••• ••• 



... 1 ••• 1 



We may further assume that the number A: of I's in the first row of T' is less than or equal to the 
number of disagreements between T and T' in any other row. The pair (m, k) is thus assumed to 
be lexicographically minimal among all such counterexample binomials. 

Consider the two rightmost columns. If there exists a pair 00 in these columns in tableau T' 
then we can swap the pair 00 with the pair 11 in the first row and get a counterexample with 
smaller value of m. Likewise, if there exists a pair 11 in these columns in tableau T then we can 
swap the pair 11 with the pair 00 in the first row and get a counterexample with smaller value of 
m. We thus conclude the sum of the two last columns in T' is at least m + 1. Likewise, the sum of 
the two last columns in T is at most m — 1. This is a contradiction to the hypothesis that T and 
T' have the same column sums. This completes the proof that Iz2,n is generated by quadrics. □ 

Theorem 1261 and our computational results suggest the following general conjecture. 

Conjecture 27. For any finite abelian group G and any positive integer n we have (j){G,n) < \G\. 

If this conjecture holds then it is natural to define the phylogenetic complexity of a group G as 

0(G) := max„>2 (f){G,n). 

The phylogenetic complexity (/>(G) is an intrinsic invariant of the group G. It makes perfect sense 
for arbitrary groups not just abelian groups. However, if G is not abelian then the phylogenetic 
complexity can exceed the group order. Using the software 4ti2, we found that (j){S3,2) > 8 for 
the symmetric group on three letters. It would be interesting to study the group-theoretic meaning 
of this invariant. For applications in computational biology, however, it is the four-element group 
of Example 1251 which deserves the most interest. We state this as a separate conjecture. 

Conjecture 28. The phylogenetic complexity (p{G) of the group G = Z2 x Z2 is four. 
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6 Evolutionary Models for DNA Sequences 



Theorem |^ follows as a corollary from Theorem [^l] and the computational results for n = 2 in 
the table of Section 5. In this section we make this explicit by deriving the quadratic and cubic 
generators of the ideal of phylogenetic invariants for the Jukes-Cantor models. The analogous 
derivation for the Kimura models will be sketched. Our discussion is aimed at computational 
biologists who wish to work with phylogenetic invariants for evolution of DNA sequences. 

6.1 Specifying the root distribution 

The theory developed so far was based on the unrealistic assumption that the structure of the root 
distribution is constrained by the group structure associated to the transition matrices. In practice, 
the root distribution will be either the uniform distribution or an arbitrary distribution. In the 
first case there are no parameters associated to the root and in the second case there are |G| — 1 
parameters associated to the root. In either case, the setup differs slightly from that of Sections 3 
and 4. In this subsection we explain why all the results including Theorem 1211 still apply. 

First we will suppose that, in our model, the root distribution tt is the uniform distribution. Let 
Cr be the corresponding root edge. By Lemma ^ the Fourier transform of the uniform distribution 
vr is the function that is equal to one when evaluated at the identity and zero otherwise. This 
means that any Fourier coordinate (?gi...g„i with gi + ■ ■ ■ + Qm 7^ is an invariant. 

Proposition 29 (More Linear Invariants). Fix it to be the uniform distribution. Then the ideal 
It.l consists of the previous invariants together with all linear invariants (lg^...gm with g{er) 7^ 0. 

Proof. All of the theory we have developed for friendly labelings still applies in this setting. The 
only change is to restrict the set of labels to the subset im(L^, 0). This is the subset of those labels 
A G im(L^) which satisfy X{er) = 0. Notice that Theorem 1211 still applies since the Ext operator is 
well-defined on sets of labels that are globally restricted on one or more edge. □ 

Now consider the case where tt is allowed to be arbitrary in the model under consideration. In 
this case we are not restricting the type of labels A which may appear, but we are in fact increasing 
the number and type of such labels. The labeling function L is no longer the same on each edge of 
the graph: it is equal to the identity function on the edge corresponding to the root distribution. 
Such a mixed labeling function need not be friendly everywhere. However, it is still friendly around 
any vertex that is not incident to the root edge. More generally, if we consider any edge of the 
tree e such that the mixed labeling function L is friendly on the tree Te._ and possibly unfriendly 
on the tree Tg^^, Theorem |^ still applies since the binomials constructed by the Ext operator are 
valid binomials. The crucial result which guaranteed that these polynomials actually contained 
unknowns which belonged to the ring was Lemma IT^ Upon inspection of its proof, however, we see 
that this only depended on L being a labeling function that was friendly on half of the tree: Tg-. 

In summary, we can apply all of our constructive results in any of the cases of biological interest, 
regardless of whether or not the root distribution is uniform or arbitrary. 
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6.2 Jukes-Cantor binary model 



Let r be a binary tree with m leaves. The Jukes-Cantor binary model has transition matrices 




Here it is not necessary to require a^ + by = 1. We can regard (a„ : 6„) as homogeneous coordinates. 

We shall derive the invariants for this model in the Fourier coordinates. First assume that the 
root distribution is arbitrary. There are no linear invariants for this model. Add an extra edge at 
the root to arrive at a new tree T' with m + 1 leaves. According to Theorem]^ we need to know 
the invariants from the tree i^i,3 at a vertex of T' to determine a generating set for the ideal of all 
invariants associated to this model. However, a direct calculation shows that there are no invariants 
associated to i^i^a (this is the first line of the table in Section 5). So we only need to consider the 
quadratic invariants associated to each edge of the tree. We now construct these explicitly. 

The Fourier coordinates are qgi...g„^+i where gi G Z2 and = 0. These coordinates can be 
identified with families of disjoint paths connecting leaves of T. Consider any interior edge e of T'. 
We relabel the leaves so that the split determined by the edge e separates the leaves 1,2, ... ,j from 
the leaves j + 1, . . . , m + 1. We construct two matrices Mq and Mi each having 2^~^ rows and 2™"-' 
columns. The rows of Mj are indexed by the sequences (gi, . . . , gj) such that gi + - ■ ■+gj = i and the 
columns are indexed by the sequences {gj+i, ■ ■ ■ , gm+i) such that gj+i + • • • + gm+i = i- The entry 
of Mi in row {gi, . . .,gj) and column (5^+1, . . . ,5m+i) is the indeterminate . The 

set Quad(e,T') is precisely the set of all 2 x 2 minors of the matrices Mq and Mi. Our generating 
set for the ideal of invariants is the union the sets Quad(e, T') as e ranges over the interior edges. 
For the case of the uniform root distribution, we add the invariants satisfying gm+i = 1- 

To obtain the ideal of invariants in the original probability coordinates we apply the inverse 
Fourier transform. In this situation, this is the same as the Hadamard transform which appears 
frequently in the phylogenetics literature Each Fourier coordinate gets replaced as follows: 

'igi...gm+l ~ 2-^ \ ^1 Plx...lm 

il,...,im&2 



Example 30 (Snowflake). Consider the tree T on five leaves pictured in Figure 2. After adding 
the extra edge at the root, we have the snowflake tree T' with six leaves. Associated to each of the 
three interior edges ei , 62, and 63 there are 56 invariants which are the 2x2 minors of two 2x8 
matrices. For instance, associated to the edge ei we get the two 2x8 matrices 



Mn 



/ ^000000 9000011 9000101 9001001 9000110 9001010 9001100 9001111 
V9110000 9110011 9110101 9111001 9110110 9111010 9111100 9111111 



M = 1 9010010 9010100 9oiiooo 9010111 9011011 9011101 9011110 

V9100001 9100010 9100100 9101000 9100111 9101011 9101101 9101110 

A probability distribution on five binary random variables comes from the Jukes-Cantor binary 
model if and only if the 2 x 2-minors of all of these six 2x8 matrices are zero. □ 
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4 3 

Figure 2: Adding an edge at the root produces a snowflake 
6.3 Jukes-Cantor DNA model 

The Jukes-Cantor DNA model has transition matrices that look like 

^ hy Cly Cly (^v\ 

d'(j 

a^Q 

\CLv / 

Here it is not necessary to require ay + b.^ = 1. We can regard (a„ : 6„) as homogeneous coordinates. 
This is a group based model for G = Z2 x Z2 with the Jukes-Cantor labeling function L : G — >■ {0, 1} 
defined in Example 1141 As was shown in ^7] for binary trees and uniform root distribution, the 
trees labeled by L are precisely the suhforests of T. There are F2m-i subforests in a binary tree 
with m leaves, where is the r-th Fibonacci number. In total, for a tree with m leaves, there 
are 3 • linear invariants of the form qgi...g^ where 51 + 52 + ■ ■ " + Qm 7^ (0) 0), and there are 

4"*"^ — F2m-i linear invariants of the form qgj^...g^ — qhi...hm where L(g{e)) = L{h{e)) for all e. 

Now we will describe the higher degree invariants. According to Theorem 1241 it suffices to un- 
derstand the invariants which arise for the (unrooted) claw tree -fCi,3. Modulo the linear invariants, 
there are only five unknowns. They correspond to the five subforests of iCi^s and they are 

9000> 9011 > 9101; 91IO1 9111- 

The phylogenetic ideal for this claw tree is generated by a single cubic polynomial 

Ikx,z,l = ( 9ooo9iii — 901l9l0l9ll0 ) • 

From this cubic we can deduce the ideal of invariants It,l provided T is a binary tree. We 
express these invariants in the labeled coordinates qx where A is a sequence in {0, Ijl^C^)! which is 
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Figure 3: The unrooted balanced binary tree when the root distribution is uniform 



a consistent labeUng of the tree T according to the Jukes-Cantor labehng function L. That is, the 
I's correspond to the edges which appear in the corresponding subforest of T. 

First, we wih describe the quadratic binomials Quad(e, T) associated to each edge e. Form the 
matrix Mo whose entries arc the unknowns qx with A(e) = 0. The matrix Mq has i^2iri--3 rows and 
-^2m+-3 columns where is the number of leaves on Te^_ and is the number of leaves of Te^-i-. 
The rows (resp. columns) of Mq are indexed by the subforests of Tg,- (resp. Tg^^) whose labeling 
on e is 0. Similarly, the matrix Mi is a i*2m--2 ^ ^2m+-2 matrix whose entries are the unknowns 
qx with A(e) = 1. The set Quad(e,r) consists of all the 2x2 minors of Mo and Mi. 

Now we will describe the cubic invariants associated to each interior vertex v of the tree. Let 
the three edges emanating from v be ei, 62 and 63. Recall that Ty^e^ is the subtree of T which has 
Ci as a leaf edge (sec Figure 1). The set of consistent labels of T^^^i that have label I on edge ej is 
denoted im(L"^^'''=i , /). This is just the set of subforests on T^^g. which have edge label I on the edge 
ej. Then we need to take all the cubic polynomials derived from qoooQiu — (ZoiiQ'ioi'/iio as follows: 



where L\ G im(L-^'''«=i , 0) , L^jL^ G im(L-^'''«=i , 1) , for all i. Now we will illustrate how to apply 
these constructions on a small example. 

Example 31 (Balanced Binary Tree). Let T be the balanced binary tree with four leaves. See 
Figure 3. We assume that the root distribution is uniform. Modulo the linear invariants there are 
= 13 indeterminates given by the 13 subforests of the binary tree with four leaves: 

900000) ^iioooj 9000115 911011) 910110) 910101) 901110) 9oiioi) 9iiii0) 910111 ) 9oiiii) 9iiioi) 9iiiii- 

The first two indices in the label correspond to the left-most leaves, the last two indices correspond 
to the right-most leaves and the middle index is the interior edge. The matrices Mq and Mi 
associated to the interior edge are respectively the F3 x F3 and F4 x F4 matrices 



^L^^L^jL^ ^L^jL/^^L^ ^L^jL^^L^ ^L^^L^^L^ ^L^^^^^^^ ^-^3)-^3i-^i' 





(20) 
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The invariants Quad(e, T) are the 2x2 minors of these matrices. Among the cubic invariants 
associated to the left interior vertex is the binomial Q'oooii'Ziiiio^iiioi — ^oiiio^ioioi^iioii- D 

To construct the invariants when the root distribution is allowed to be arbitrary amounts to 
changing the labeling function associated to the root distribution to the identity labeling. There 
are 11 labeled Fourier indeterminates modulo the linear invariants. A direct computation using the 
software 4ti2 shows that the ideal of invariants is generated by 9 quadrics and 6 cubics. 

6.4 Kimura 3-parameter model 

The Kimura 3-parameter model has transition matrices that looks like 



/ dy 


av 


by 


Cv\ 


av 


dv 


Cy 


bv 


K 


Cy 


dv 


Qv 


\Cv 


bv 


Qv 


dv) 



Here the labeling L is the identity function on Z2 x Z2 = {A, G, C, T}. The labeling being injective, 
there are no linear invariants. We add a root edge to get a tree with one more leaf. We first form 
the set of quadrics Quad(e, T) for each interior edge e. They are the 2 x 2-minors of four matrices 
Ma, Mq, Mc, Mt, one for each of the nucleotides which may appear in the labeling on that edge. 
The next step is to determine the set of local binomials Fixt{Bv T) for any interior vertex v. 
The ingredients for this are the 16 cubics and the 18 quartics displayed in Example 1251 These 34 
binomials form a Grobner basis for the ideal of the claw tree i^i^a, and to each of them we apply 
the extension procedure of Lemma 1221 Adding the resulting large collection of cubics and quartics 
to the previous 2 x 2-minors gives generators for the ideal of the Kimura 3-parameter model. 

6.5 Kimura 2-parameter model 

The Kimura 2-parameter model has transition matrices that look like 

^ c-y dv bv bv \ 
Civ ^v bv bv 

bv bv Cv Civ 

\bv bv a>v ^v / 

Here the group is also ^2X^2 = {A, G, G, T}, but the labeling function is not injective. It is 

L : Z2 X Z2 ^ {0, 1, 2} with L{A) = 0, L(G) = 1 and L{G) = L{T) = 2. 

See Example llfll Finding the set im(L^) of consistent labelings on a binary tree T is a combinatorial 
problem which we will not address here. (What is the analogue to the Fibonacci numbers ?) 
Assuming this has been accomplished and the precise list of indeterminates qx is known, then the 
description of the set of quadrics Quad(e, T) associated with an interior edge e is just like before. 
They are the 2 x 2-minors of three matrices Mq, Mi, M2 whose entries are the unknowns qx. 
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In light of Theorem 1241 the remaining task is to understand the ideal of invariants for the 
claw tree -f^i^s- Returning to the setup at the beginning of Section 3, this is an ideal in the 16 
Fourier coordinates Qg^m^ai ^^^re gi + §2 = 93- There are six linear invariants for this tree, which 
correspond to pairs of triples (91,92,93) and (/ii,/i2,^3) such that L{gi) = L{hi) for all i. Modulo 
the linear invariants, the polynomial ring has ten indeterminates q\. These are 

9000) 9011) 9022, 9101) 9110) 9122) 9202) 9212) 9220) 9221- 

The ideal of invariants for the claw tree i^i,3 modulo the linear invariants has a Grobner basis 
consisting of six cubics and three quartics. For example, the following two binomials appear in it: 

9o229ioi9220 — 9ooo9i22922i and 9o229ioi9iio — 9ooo9oii9i22- 
Theorem 1^ tells us how to construct the invariants for any binary tree from these local data. 

7 Algebraically Independent Invariants Are Not Enough 

Each algebraic variety X we have studied in this paper lives in an ambient space of A;™ dimensions, 
where m is the number of leaves of the given tree and k is the number of states of each random 
variable. The coordinates of the ambient space are the probabilities Piii2---im, or their Fourier 
transforms (li^i2--im- The dimension of the model X is the number of free model parameters, and 
the codimension of the model X is 

codim(X) = A;™ - dim(X). 

This is the number of local equations needed to describe the variety X at a smooth point 10 . 
However, in general, the number of equations needed to describe X at a singular point, or the 
number of equations needed to define a variety X globally, can be much larger than the codimension. 

Several research articles on phylogenetic invariants give the impression that to characterize a 
model X, it suffices to take only codim(X) polynomial invariants, and some authors raised the 
question whether there is a complete list of algebraically independent invariants. We wish to argue 
that, both from the perspective of algebraic geometry and from the perspective of computational 
biology, it is misleading and wrong to ask for a set of only codim(X) polynomial invariants. 

Most models in algebraic statistics, including the group-based evolutionary models treated here, 
are not complete intersections, i.e., these models require more polynomial equations than their 
codimension. This holds even if one is only interested in strictly positive probability distributions. 
In the opinion of the authors, a given system of polynomial invariants for an evolutionary model 
X cannot be considered "complete" unless it actually generates the prime ideal of X. 

We illustrate this issue for the case when X is the Jukes-Cantor binary model (hence A; = 2) on 
the fully balanced binary tree with m = 4 leaves. The parametric representation for this model was 
given by (jlj. The variety X has codimension 8. The homogeneous prime ideal of the model is given 
by the 2 x 2- minors of the four 2 x 4-matrices in @. This ideal requires 20 minimal generators. 
Can we replace these 20 quadrics by a smaller subset? Don't eight suffice? 
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The answer is clearly "no" when X is the complex variety defined by requiring that the matrices 
^ have rank one. However, more than eight equations are needed even if we consider a small 
neighborhood of the centroid of the probability simplex. This centroid is the uniform distribution 
on the leaf colorations. In Fourier coordinates, this neighborhood is given by setting goooo = 1 and 
by assuming that the other 15 coordinates Qij^i are real numbers of small absolute value. 

If we add the trivial invariant goooo — 1 to our 20 quadrics, then the resulting ideal in the 
polynomial ring in 15 unknowns still has codimension 8 but it is now minimally generated by ten 
equations. The first five of these ten equations express five of the unknowns in terms of the others: 

91110 - ft 100^70010 5 Qini — 9110090011; 9iooi — 9iooo9oooi7 9oiii - 9ooii9oioo , 9ioii — 9ooii9iooo- 

What remains is an ideal of codimension 3 which is minimally generated by five quadrics. The five 
quadrics are the five 2 x 2-minors not involving the upper left corner in the following matrix: 

90010 9oooi\ 
90100 90110 90101 j 
^91000 91010 91001/ 

If we remove any of these five quadrics then the zero set of the remaining four equations contains 
points which are not in the model, even in a neighborhood of the uniform distribution. For example, 
we get extraneous solutions by placing small positive reals eijki in the matrices 

/ • 

and eoioo 
\eiooo 

Notice that matrices with these entries are near the centroid of the probability simplex and satisfy 
all but one of the five 2 x 2-minors of the matrix. Thus we need all five quadrics to define our 
variety, even set-theoretically, and even locally around the uniform distribution. We regard the 
determinantal formula © as the best representation of the ideal of phylogenetic invariants. 

The failure to describe a phylogenetic model X set-theoretically becomes much more dramatic 
if we replace the ideal generators derived in this paper with the canonical invariants introduced by 
Szekely, Steel and Erdos |2i]|. The number of canonical invariants is always equal to the codimension 
of X, but, as we have argued, this means that they are far from having the correct zero set. For 
the specific Jukes-Cantor binary model with m = 4 discussed above, there are eight canonical 
invariants. From '20' Theorem 10], we see that they are the following binomials of degree eight: 

9oooo9ooio9oioo9oiio9iooi9ioii9iioi9iiii — 9oooi9ooii9oioi9oiii9iooo9ioio9iioo9iiiO; 
9oooo9ooio9oioi9oiii9iioo9iiio9ioii9iooi — 9oooi9ooii9oioo9oiio9iiii9iioi9iooo9ioiOi 
9oooo9ooio9oiii9oioi9iiii9iioi9iooo9ioio — 9oooi9ooii9oioo9oiio9iioo9iiio9ioii9iooi; 
9oooo9oooi9oioo9oioi9iiii9iiio9ioii9ioio — 9ooii9ooio9oiii9oiio9iioo9iioi9iooo9iooi) 
9oooo9oooi9oiii9oiio9iiii9iiio9iooo9iooi — 9ooii9ooio9oioo9oioi9iioo9iiii9iiii9ioiO) 
9oooo9oooi9oiii9oiio9iioo9iioi9iooi9ioio — 9ooii9ooio9oioo9oioi9iiii9iiio9iooo9iooi) 
9oooo9ooii9oiio9oioi9iiio9iioi9iooo9ioii — 9ooio9oooi9oioo9oiii9iioo9iiii9ioio9iooi5 
9oooo9ooii9oioo9oiii9iiio9iioi9ioio9iooi — 9ooio9oooi9oiio9oioi9iioo9iiii9iooo9ioii- 
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The zero set of these equations has codhnension three (!), and has many irreducible components. 
The structure of the primary decomposition of the ideal of canonical invariants is very complicated. 
For instance, among the irreducible components, there are 48 linear spaces of codimension 3, e.g. 

quu = (Ziioo = quio = 0. 

Among all the probability distributions which satisfy the eight canonical invariants listed above, the 
distributions which come from the Jukes-Cantor model represent a subset that has measure zero 
(codimension 8 inside codimension 3). For practical applications, this implies that an empirical 
distribution which is near to the solution set of the canonical equations cannot be trusted to come 
from the model. Although the canonical invariants define the model locally almost everywhere on 
the model distributions, they do not define the model globally in the entire probability simplex. 

The canonical equations correspond to a lattice basis for the toric ideal of phylogenetic invari- 
ants. It follows from general theory in commutative algebra that the toric ideal can computed from 
the canonical equations by the process of saturation (as described in ^| Algorithm 12.3]), but 
this is a non-trivial and time-consuming computation. What we have accomplished in this paper 
is an explicit description of a list of phylogenetic invariants which minimally generates the toric 
ideals of interest. This implies that globally (in the probability simplex, in M'^"', or in C'^™) the 
only points which satisfy all the invariants come from the model. However, in all cases (with the 
exception of a few trivial ones), the number of our polynomial invariants is considerably larger than 
the codimension of the model, a feature which is unavoidable in algebraic geometry. 

There is another important motivation, coming directly from computational biology, for our 
representation of the phylogenetic invariants. Evolutionary models have to allow for the possibility 
of heterogenous rates as described in [71|H]. For instance, in the evolution of DNA sequences, one 
may wish to model two different rates: one for genes and one for non-genes. This replaces our given 
parameterization ((T)) by the superposition of two evolutionary models of the same kind: 

veViT)\{r} »;eV{T)\{r} 

In statistics, this corresponds to introducing a hidden binary variable. In geometry, we are passing 
to the secant variety (see jHI §7]). Our determinantal presentation of the invariants Quad(e,T) 
makes it easy to derive some invariants for models with heterogeneous rates. For instance, the 
cubic invariant discovered in is nothing but the determinant of the 3 x 3-matrix in H2U() . 

8 Conclusion 

This paper gives a solution to the longstanding problem of finding all phylogenetic invariants for 
the statistical models of evolution which have a group structure. We found explicit Grobner bases 
for the ideals of the Jukes-Cantor and Kimura models for DNA sequences. This was accomplished 
by developing a general machinery for building invariants from the local features of a tree and 
extending them to the entire tree. There are, however, many questions of a practical nature which 
remain. The main issue is how to use invariants to recover the phylogeny of a collection of taxa. 
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First and foremost is the question of what statistical significance should be attached to the 
numerical values that are obtained by evaluating the phylogenetic invariants at sample data. In- 
tuitively, if the data come from the model associated to a particular tree, the evaluation of an 
invariant polynomial should be small. How should this intuitive understanding be applied in prac- 
tice? This is really a general open problem associated with the polynomial functions that vanish on 
any statistical model. The point of working with these polynomial invariants is that they should 
eliminate the potentially difficult problem of approximating solutions to the maximum likelihood 
equations. However, most statistical tests (e. g. x^, G^) depend on comparing the empirical distri- 
bution to the maximum likelihood estimates. The fundamental open question we wish to pose to 
statisticians is to develop statistical tests for deciding whether or not the data fits a given model 
based solely on the evaluation of the polynomials which vanish on the model distributions. 

Even if the statistical issues in the previous paragraph can be resolved, before we can start 
implementing a phylogeny recovery method based on algebraic invariants, the help of computer 
scientists is needed to address the following challenging complexity question: How can we evaluate 
exponentially many polynomials in exponentially many indeterminates for exponentially many trees? 
The structural results about phylogenetic invariants derived in this paper should help. For instance, 
the techniques of Section 4 will allow one to hunt for local features of the tree (e. g. 2- or 3-splits of 
the leaves) and assemble the tree piece by piece. Furthermore, our results show that all quadratic 
phylogenetic invariants are rank conditions on matrices associated to the splits of the tree, so they 
can be interpreted as conditional independence statements in the sense of graphical models. These 
invariants are clearly well-suited for the development of highly efficient algorithms. 

Finally, now that we have explicit Grobner bases for the phylogenetic invariants of a group 
based model, there remains the problem of determining how good invariant-based methods are 
at recovering phylogenies in problems of interest to biologists. Implementation and testing of 
invariant-based methods should be an expanding area of future research, based on the work in this 
paper and the results of Allman and Rhodes |^ |5] for the general Markov model. 
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